I’m here to learn new skills and improve olds. The course fits perdectly to my needs. I have used R a lot, but havent shared my code outside of our research group, so learning good practises is very welcome. I did go throw Exercise1 ‘warm up’ pretty fast and everything was pretty familiar. As a note for my self, the syntax and used functions was many way different than what i have used to get same goal. Might be good to start to use less ‘intermediate’ objects and more ‘tidyverse’ style. I like to use Atom for writing a script, so it’s be nice to try to do things differently.
My github: https://github.com/jvehvi/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Dec 1 21:40:59 2022"
This week topic was Regression and model validation. Here we present basic R commands to build up summary tables and linear models which can be used for finding cause-and-effect relationships between different variables.
# Set up workind dir
setwd("E:/Open science course 2022/IODS-project")
dir.create(paste0(getwd(),"/Data"))
## Warning in dir.create(paste0(getwd(), "/Data")): 'E:\Open science course
## 2022\IODS-project\Data' already exists
setwd(paste0(getwd(),"/Data"))
Bring .txt file to R environment which has been downloaded to Data - folder.
Table should have 183 rows and 60 cols containing integer values in all except last column which contains information about gender by character.
Data set contains summary results of course ‘Johdatus yhteiskuntatilastotieteeseen, syksy 2014’ survey. There should be 7 variables (gender, age, attitude, deep, stra, surf and points) and 166 observations.’’’ Gender: Male = 1 Female = 2 Age: Age (in years) derived from the date of birth Attitude: Global attitude toward statistics. Mean of original variables (~Da+Db+Dc+Dd+De+Df+Dg+Dh+Di+Dj) Deep: Deep approach. Mean of original variables (~d_sm+d_ri+d_ue) Stra: Strategic approach. Mean of original variables ( ~st_os+st_tm) Surf: Surface approach. Mean of original variables (~su_lp+su_um+su_sb) Points: Total counts from survey. More information about used variables can be found from http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS2-meta.txt’’’
## [1] 166 8
By commend summary we can take a look ‘inside’ the data. By the command we get ‘boxplot information’ about our numerical data in dataframe. Scatterplot matrix is used to describe relationships between the variables. It’s constructed from the dataframe with ggpairs -function (ggplot2 -package). Result plot shows, in addition of variables relationships, variables diverging and gives correlation coefficients with asterix showing level of significance.
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Gender Age Attitude Deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.375
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.250
## Mode :character Median :22.00 Median :3.200 Median :3.750
## Mean :25.51 Mean :3.143 Mean :3.645
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.000
## Max. :55.00 Max. :5.000 Max. :4.875
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Most promising relationship seems to be between: Attitude and Points,
and Surf and Deep.
There seems to be also some kind of relationship between: Stra and Surf.
As overall, matrix gives good overlook of data, and starting point to
study more relationships between variables’’’
Create a regression model with multiple explanatory variables
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = learning_2014.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## Stra 0.8531 0.5416 1.575 0.11716
## Surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Summary of a regression model shows that only Attitude seems to correlate significantly with Points. From the print, you can see model residiuals; summary table (estimate value, std. error, t-value and p-value for all variables in model against Points). Significance of variable correlation can be read from p-value(last column of Coefficients table). Significance levels threshols are given under the table. Summary gives also p-value for whole model which isn’t significant because model contains variables that hasn’t have relationship to Points. In next step, lest remove other variables from the model and see what happens
##
## Call:
## lm(formula = Points ~ Attitude, data = learning_2014.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now results seems to be better and p-value is significant for the variable relationship as well as for the model. Multiple R-squared is the proportion of the variation in dependent variable that can be explained by the independent variable. So in the model where we haves three variables 20.74 % of the variation in Points can be explained by variables. But now interesting is that Attitude by itself explains 19.06 % of the variation. Showing us that Stra and Surf effect to the points is pretty minimal.
From the ‘Residual vs leverage’ - plot we can check which and how many of observation are influential. In our case data seems good and there isnt any point outside Cook distance lines.
Also ‘Residual vs. Fitted’ - plot seems good. Data is divided evenly in x - and y-axle.
QQ-plot also indicates goodness of our model. If the points runs out too early from the line, there might be some other variables effecting our relationship more than the Attitude variable. In this case QQplot seems to be really nice, but noT perfect. ```
For the study material from UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page (https://archive.ics.uci.edu/ml/datasets/Student+Performance) has been used as a source data table. The joined data set which will be used in the analysis, contains two student alcohol consumption data sets from the web page. The variables not used for joining the two data have been combined by averaging (including the grade variables): + ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ + ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise More information about variables can be found from: https://archive.ics.uci.edu/ml/datasets/Student+Performance
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## Warning: package 'tibble' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Read data table to R - environment
wrangled_student_mat_por <- as.data.frame(read_csv("Data/wrangled_student_mat_por.csv"))
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(wrangled_student_mat_por)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
For studying variables relationships to alcohol, I have selected four variables that I hypothesize to have relationship to person alcohol consumption.
Age - I think that persons over 18 use more alcohol.
romantic - I think that single persons drinks more often.
studytime - I think persons which study more weekly, drink less.
freetime - I think persons which have more free time might drink more.
library(ggplot2)
library(GGally)
# Summary table
int_variables<-c("age","romantic","studytime","freetime","alc_use","high_use")
summary(wrangled_student_mat_por[int_variables])
## age romantic studytime freetime
## Min. :15.00 Length:370 Min. :1.000 Min. :1.000
## 1st Qu.:16.00 Class :character 1st Qu.:1.000 1st Qu.:3.000
## Median :17.00 Mode :character Median :2.000 Median :3.000
## Mean :16.58 Mean :2.043 Mean :3.224
## 3rd Qu.:17.00 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :22.00 Max. :4.000 Max. :5.000
## alc_use high_use
## Min. :1.000 Mode :logical
## 1st Qu.:1.000 FALSE:259
## Median :1.500 TRUE :111
## Mean :1.889
## 3rd Qu.:2.500
## Max. :5.000
# Unique values for romantic column
unique(wrangled_student_mat_por$romantic)
## [1] "no" "yes"
# Group by romantic before summarise
wrangled_student_mat_por %>% group_by(romantic) %>% summarise(count = n())
## # A tibble: 2 × 2
## romantic count
## <chr> <int>
## 1 no 251
## 2 yes 119
# Scatterplot
p <- ggpairs(wrangled_student_mat_por[int_variables], mapping = aes(), lower =list(combo = wrap("facethist", bins = 20)))
p
Results shows that there seems to be some kind of relationship at least between alchol and freetime, - studytime and age. POsitive correlation can be seem between alcohol use and freetime. Negative correlation between alcohol use and study time, and alcohol use and age. Intresting find out is that young persons (15-18) seems to drink more than ages 19 -21.
Use glm() to fit a logistic regression model with
high_use as the target variable and
freetime, studytime, age and romantic as the
predictors.
m <- glm(high_use ~ freetime + studytime + age + factor(romantic), data = wrangled_student_mat_por, family = "binomial")
# Summary
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + studytime + age + factor(romantic),
## family = "binomial", data = wrangled_student_mat_por)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6628 -0.8579 -0.6654 1.1899 2.3332
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.4470 1.7762 -2.504 0.012293 *
## freetime 0.3269 0.1238 2.641 0.008274 **
## studytime -0.5850 0.1572 -3.721 0.000198 ***
## age 0.2246 0.1025 2.190 0.028512 *
## factor(romantic)yes -0.2158 0.2594 -0.832 0.405404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 421.66 on 365 degrees of freedom
## AIC: 431.66
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01171392 0.0003399901 0.3664590
## freetime 1.38664482 1.0914115708 1.7751196
## studytime 0.55710460 0.4053055149 0.7516885
## age 1.25176377 1.0261046822 1.5355541
## factor(romantic)yes 0.80586877 0.4806650258 1.3321754
The results shows that widest range is for variable ‘romantic’. Range also contains 1 so most probably alcohol_consumption and romantic don’t have relationships, more like the variables are independent of each other. Freetime and age has OD and condidence interval > 1, so there seems to be positive correlation between those and high_use. Studytime values are < 1 meaning negative correlation between it and high_use.
Next, I have build up new model excluding ‘romantic’ which seems to have no-relationship to students alcohol use. After building model, I have taken summary information about the model and calculated the mean prediction error.
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Let's select only freetime, studytime and age for the next model
m <- glm(high_use ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial")
# Summary
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + studytime + age, family = "binomial",
## data = wrangled_student_mat_por)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6143 -0.8690 -0.6895 1.2088 2.2716
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2829 1.7668 -2.424 0.015345 *
## freetime 0.3246 0.1236 2.626 0.008630 **
## studytime -0.5928 0.1575 -3.765 0.000167 ***
## age 0.2119 0.1014 2.089 0.036700 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 422.36 on 366 degrees of freedom
## AIC: 430.36
##
## Number of Fisher Scoring iterations: 4
# Making predictions on the train set.
# Probability
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, probability = predict(m, type = "response"))
# Prediction
wrangled_student_mat_por <- mutate(wrangled_student_mat_por, prediction = probability > 0.5)
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = wrangled_student_mat_por$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 96 15
# Making predictions on the test set.
test_pred<- ifelse(predict(m, newdata = wrangled_student_mat_por, type = "response") > 0.5, "TRUE", "FALSE")
# 2x2 cross tabulation of predictions versus the actual values
table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred)
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 96 15
# We can study our model goodness more by looking ConfusionMatrix of the results
confusionMatrix(table(high_use = wrangled_student_mat_por$high_use, prediction = test_pred))
## Confusion Matrix and Statistics
##
## prediction
## high_use FALSE TRUE
## FALSE 245 14
## TRUE 96 15
##
## Accuracy : 0.7027
## 95% CI : (0.6533, 0.7488)
## No Information Rate : 0.9216
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1028
##
## Mcnemar's Test P-Value : 1.136e-14
##
## Sensitivity : 0.7185
## Specificity : 0.5172
## Pos Pred Value : 0.9459
## Neg Pred Value : 0.1351
## Prevalence : 0.9216
## Detection Rate : 0.6622
## Detection Prevalence : 0.7000
## Balanced Accuracy : 0.6179
##
## 'Positive' Class : FALSE
##
# Prediction plot
# access dplyr and ggplot2
library(dplyr); library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'wrangled_student_mat_por'
# define the geom as points and draw the plot
g <- ggplot(wrangled_student_mat_por, aes(x = high_use, y =as.numeric(probability))) +
geom_point(aes(y = as.numeric(probability)), alpha = 0.2)
# ROC curve
library('pROC')
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
test_prob <- predict(m, newdata = wrangled_student_mat_por, type = "response")
test_roc <- roc(wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
test_roc
##
## Call:
## roc.formula(formula = wrangled_student_mat_por$high_use ~ test_prob, plot = TRUE, print.auc = TRUE)
##
## Data: test_prob in 259 controls (wrangled_student_mat_por$high_use FALSE) < 111 cases (wrangled_student_mat_por$high_use TRUE).
## Area under the curve: 0.6808
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
mean(actual != predicted)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = wrangled_student_mat_por$prediction)
## [1] 0.2972973
# call loss_func to compute the average number of wrong predictions in the (test) data
# Error rate should be near each other with training and testing data
calc_class_err(actual = wrangled_student_mat_por$high_use, predicted = test_pred)
## [1] 0.2972973
Our results shows that approximately 30 % of our predictions are wrong, so the accuracy for our model is 0.7 which is at least better than pure guess. We selected 0.5 as a threshold for classification, that value is our choice and it goodness for accuracy can be study from ROC-curve. In this study, it’s okay. Test error rate is same as train error rate, so our model seems to be fitted properly. When predictions are compared to high_use we can see that our predictions have too many < 2 points drinker compared to real situation, and reversal too less high user.
Lets perform 10-fold cross-validation for our model to see if we get better test set performance. 10-fold means that all our observations are grouped to 10 bins
library(caret)
#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)
#fit a regression model and use k-fold CV to evaluate performance
model <- train(factor(high_use) ~ freetime + studytime + age, data = wrangled_student_mat_por, family = "binomial", method = "glm", trControl = ctrl)
model
## Generalized Linear Model
##
## 370 samples
## 3 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 334, 333, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6975383 0.09043105
# Show also summary of model
summary(model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6143 -0.8690 -0.6895 1.2088 2.2716
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2829 1.7668 -2.424 0.015345 *
## freetime 0.3246 0.1236 2.626 0.008630 **
## studytime -0.5928 0.1575 -3.765 0.000167 ***
## age 0.2119 0.1014 2.089 0.036700 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 422.36 on 366 degrees of freedom
## AIC: 430.36
##
## Number of Fisher Scoring iterations: 4
#View final model parameters
model$finalModel
##
## Call: NULL
##
## Coefficients:
## (Intercept) freetime studytime age
## -4.2829 0.3246 -0.5928 0.2119
##
## Degrees of Freedom: 369 Total (i.e. Null); 366 Residual
## Null Deviance: 452
## Residual Deviance: 422.4 AIC: 430.4
The results shows that our accuracy is app. 0.70 meaning that 30 % of our predictions are wrong. Actually, results are almost same than with Logistic regression model, so in this case there isn’t any point to use 10-fold cross-validation. Typically, as k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller but there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically choice for k has been done using k=10 or k=5 because those has been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance According summary table of model, study time has most negative impact to alcohol use. Freetime has most positive correlation to high_use, and age variable has also significant effect to drinking habit. Basically, results seems to be line with earlier results.
For testing and comparing different models performances, I have select 10 variables as a starting point and removed ‘the worst’ one after one that we have three variables left. Selected variables are: * famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
health - current health status (numeric: from 1 - very bad to 5 - very good)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
age - student’s age (numeric: from 15 to 22)
The data has been divided to test and training sets, so we could better understood those two relationships.
# select only columns under interest
wrangled_student_mat_por_sub <- wrangled_student_mat_por[,colnames(wrangled_student_mat_por) %in% c("famsize","Medu","Fedu","traveltime","famrel","sex","health","freetime","studytime","age","high_use")]
# Divide data to test and training sets by using 70 % of the data for the training set
dat <- list()
dat[['index']] <- sample(nrow(wrangled_student_mat_por_sub), size = nrow(wrangled_student_mat_por_sub) * 0.7)
dat[['training']] <- wrangled_student_mat_por_sub[dat$index, ]
dat[['test']] <- wrangled_student_mat_por_sub[-dat$index,]
# Empty lists
count_of_variables <- test_mses <- training_mses <- list()
# For loop to test different counts of variables
for (i in 1:(ncol(wrangled_student_mat_por_sub)- 1)) {
# Build new model every iteration containing one variable less
train_mod <-glm(high_use~., data = dat$training[,i:ncol(dat$training)])
# Predict values for training data
y_hat_training <- predict(train_mod, type = "response")
train_predict = y_hat_training > 0.5
# Collect training error
training_mses[[i]] <- calc_class_err(actual = dat$training$high_use, predicted = train_predict)
# Predict values for testing data
y_hat_test <- predict(train_mod, newdata = dat$test)
test_predict = y_hat_test > 0.5
test_mses[[i]] <- calc_class_err(actual = dat$test$high_use, predicted = test_predict)
count_of_variables[[i]] <- length(train_mod$coefficients) - 1
}
# Build dataframe from results to be used in ggplot
df<-data.frame(Number_of_Variables=unlist(count_of_variables), Test_error=unlist(test_mses),
Training_error=unlist(training_mses))
# Build plot
library("reshape2")
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
df_long<-melt(df, id="Number_of_Variables",)
df_long$Number_of_Variables<-factor(df_long$Number_of_Variables, levels=df$Number_of_Variables)
a<-ggplot(df_long, aes(x = Number_of_Variables,y=value, color=variable, group=variable)) +
geom_point() +
geom_line() +
geom_smooth() +
scale_x_discrete(limits = rev(levels(df$Number_of_Variables)))
a
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In the plot, test - and training errors against ‘Number of prediction variables in model’ are shown with own lines. I have selected ten variables which might have effect to alcohol consumption. Typically, training error which is less than test error indicates that model over fits to the training data. The model can be think to be good enough when test and training errors are near each other. E.g. from the plot, can be see that training error first rise in range 10-8 and in same range test error is lower and remains almost unchanged. Over fitting can be seen with range 7-5 and 3-1. When we have four variables in our model, test and training error are near each others indicating goodness of our model. And this results, seems to change depending how data is divided to test and training sets. According these results, the logistic regression model with four variables seems to be good if selected variables are good. Different grouping for selected variables should be also performed to catch out which four variables would be the best ones.
This assignment topic is clustering and classification. As basic, clustering means that in the data some points/observations are in space so near each other compared to other points/observations that those form a ‘own cluster’. There are multiple different clustering methods to find those clusters from the data. One of the most common methods is k-means clustering, others worth naming are hierarchical clustering methods which gives dendrograms as a output. If the clustering can be done successfully, new observations can be tried to classify to those clusters.
To clarify this topic, Boston dataset from MASS - package will be used as a demonstration dataset.
The Boston Dataset
The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The Boston data frame has 506 rows and 14 columns. The following describes the dataset columns:
# Bring needed libraries to R-environment
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
library(reshape2)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# Check head of Boston table
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# Check dimensions
dim(Boston)
## [1] 506 14
# Check summary of data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# Visualize data for inspecting
## Scatterplot
ggpairs(Boston, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20)))
## All variables against crim rate
bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
facet_wrap(~variable, scales="free")+
geom_point()
# Correlation matrix and visualization
## Matrix
corrmatrix <- cor(Boston)
corrmatrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
## Visualization
corrplot(corrmatrix, method="circle", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, type = "upper")
There
are 506 different observations for 14 variables whivh summaries
informations concerning housing in the area of Boston Mass. By using
this dataset, we can try to find out different relationships between
variables. E.g. per capita crime rate relationship to other variables is
very interesting. If we could find out variables which has the most
effect to it, could in future try to effect these variables and in
advance reduce develop of crime rate when planning new residential
areas. From the summary table, we can see that not all information from
variables are easy or wise to use as a predictors in same model as those
are now. E.g. “proportion of residential land zoned for lots over 25,000
sq.ft.” (zn) - variable observations are unevenly distributed where as
“pupil-teacher ratio by town” (ptratio) - variable observations are much
more evenly distributed. But interesting is that when we plot other
variables against crime rate, we can see that there could be correlation
even that the distribution wouldn’t be so clear. From the correlation
matrix, we can see that there are some kind positive or negative
correlation with per capita crime rate and other variables. Six
variables have negative correlation and seven have positive. Most less
negative correlation is with “Charles River dummy variable (1 if tract
bounds river; 0 otherwise)” (chas) and overall negative correlate
factors impact seems to be low. Positively correlate factors impact
seems to be much higher and “index of accessibility to radial highways”
(rad) - and “full-value property-tax rate per $10,000” (tax) seems to
have the most positive correlation to per capita crime rate. Same
results are visualized in the plot where conclusions can be made more
easily.
To get more clear understanding about the data and make it usable in prediction model, it has to be standardized. To do that, we scale and center the columns of a numeric matrix. Centering is done by subtracting the column means (omitting NAs) of x from their corresponding columns. Scaling is done by dividing the (centered) columns of x by their standard deviations.
# Check head of original data to compare scaled data
data("Boston")
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# center and standardize variables, also set output to be data.frame
boston_scaled <- as.data.frame(scale(Boston, center = TRUE, scale = TRUE))
head(boston_scaled)
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
# Set 'crim' to numeric variable
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# Summary of crim variable
summary(boston_scaled[,'crim'])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# class of the boston_scaled object
class(boston_scaled)
## [1] "data.frame"
# Inspect how values has changed
# Check summary of data
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Visualize data for inspecting
## Scatterplot
ggpairs(boston_scaled, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20)))
## All variables against crim rate
bosmelt <- melt(boston_scaled, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
facet_wrap(~variable, scales="free")+
geom_point()
# Correlation matrix and visualization
## Matrix
corrmatrix <- cor(boston_scaled)
corrmatrix
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
## Visualization
corrplot(corrmatrix, method="circle", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, type = "upper")
After
scaling we can see that values across different variables are much more
near each others. From the correlation matrix, we see that we got same
results as with raw values.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled[['crim']])
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled[['crim']], breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
head(boston_scaled)
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## 6 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Set 'crime' to factor variable, and give relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
# Bring Boston data table from github to R
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt", sep=",", header = T)
# Change 'crime' to factor variable with relevant levels
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
# Build up test and train datasets
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
dim(train)
## [1] 404 14
# create test set
test <- boston_scaled[-ind,]
dim(test)
## [1] 102 14
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
head(test)
## zn indus chas nox rm age dis
## 3 -0.48724019 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.48724019 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 8 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069 0.9778406 1.023625
## 9 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.9302853 1.1163897 1.086122
## 10 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130 0.6154813 1.328320
## 11 0.04872402 -0.4761823 -0.2723291 -0.2648919 0.1314594 0.9138948 1.211780
## rad tax ptratio black lstat medv
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 8 -0.5224844 -0.5769480 -1.5037485 0.4406159 0.9097999 0.4965904
## 9 -0.5224844 -0.5769480 -1.5037485 0.3281233 2.4193794 -0.6559463
## 10 -0.5224844 -0.5769480 -1.5037485 0.3289995 0.6227277 -0.3949946
## 11 -0.5224844 -0.5769480 -1.5037485 0.3926395 1.0918456 -0.8190411
Now we have build up test and training data by selecting randomly first 80 % off data as a training set and using 20 % as a test set. Both data set will be used in next part.
Now we use training set in linear discriminat analysis. The function tries hard to detect if the within-class covariance matrix is singular. If any variable has within-group variance less than ‘tolerance to decide’^2 it will stop and report the variable as constant. As a return function gives:
*prior - the prior probabilities used.
*means - the group means.
*scaling - a matrix which transforms observations to discriminant functions, normalized so that within groups covariance matrix is spherical.
*svd - the singular values, which give the ratio of the between- and within-group standard deviations on the linear discriminant variables. Their squares are the canonical F-statistics.
*N - The number of observations used.
*call - The (matched) function call.
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2524752 0.2524752 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 0.95211557 -0.9224255 -0.07742312 -0.8934226 0.47539272 -0.8760937
## med_low -0.09293868 -0.3053227 0.03646311 -0.5566211 -0.16425574 -0.3657515
## med_high -0.37626407 0.2071130 0.22945822 0.4712630 0.04649692 0.3955372
## high -0.48724019 1.0171737 -0.07348562 1.0540712 -0.46140341 0.8090769
## dis rad tax ptratio black lstat
## low 0.8824421 -0.6941859 -0.7625874 -0.44456744 0.37759521 -0.76041803
## med_low 0.3005443 -0.5348698 -0.4144773 -0.01070727 0.30790543 -0.16756510
## med_high -0.3863936 -0.4098896 -0.2839422 -0.30641729 0.06181208 0.05793277
## high -0.8488579 1.6375616 1.5136504 0.78011702 -0.89231005 0.90714061
## medv
## low 0.57948348
## med_low -0.01475981
## med_high 0.15750243
## high -0.66813722
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12444212 0.61822554 -0.91680163
## indus 0.03020150 -0.31889198 0.01759476
## chas -0.08680234 -0.03321606 0.15026279
## nox 0.34716661 -0.91868215 -1.21008824
## rm -0.09602797 -0.08703451 -0.23591304
## age 0.28322899 -0.25784030 -0.12528367
## dis -0.03900419 -0.34277578 -0.10008733
## rad 3.22404579 0.82751604 -0.50217623
## tax -0.03281620 0.20860867 1.03728043
## ptratio 0.10710110 -0.01925018 -0.26626352
## black -0.13765518 0.03113053 0.12156177
## lstat 0.26514570 -0.23990031 0.08409575
## medv 0.22554308 -0.40781608 -0.31189379
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9476 0.0392 0.0133
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2) +
lda.arrows(lda.fit, myscale = 1)
## integer(0)
From the results plot we can see that rad seems to most driving variable in our training set for explaining LD1. And from the result object we can see that it has most positive coefficient 3.62727 to LD1 whereas second most effective variable seems to be nox with only 0.307. For the LD2, zn has highest positive coefficient 0.624225492. ## Predict the classes with the LDA model on the test data
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
lda.pred
## $class
## [1] med_low low med_high med_high med_low med_low med_high med_high
## [9] med_high med_high med_high med_low med_low med_low med_low low
## [17] med_low low low med_low med_low med_low med_low med_low
## [25] low low med_low med_high med_high med_high med_high med_high
## [33] med_high med_high med_high med_high med_high med_high med_low med_high
## [41] low low low med_low med_low med_high med_high med_low
## [49] med_high low med_low low med_high med_high low low
## [57] low low low med_low med_low med_low med_low med_low
## [65] med_low low med_low med_low med_low low low low
## [73] low high high high high high high high
## [81] high high high high high high high high
## [89] high high high high high high high high
## [97] high high high high high high
## Levels: low med_low med_high high
##
## $posterior
## low med_low med_high high
## 3 4.211994e-01 4.587392e-01 1.200614e-01 1.358990e-20
## 4 5.141024e-01 4.108262e-01 7.507143e-02 4.370198e-20
## 8 3.277469e-02 2.498857e-01 7.173396e-01 1.562811e-13
## 9 1.980794e-02 3.478541e-01 6.323380e-01 1.298565e-12
## 10 8.418955e-02 4.944640e-01 4.213465e-01 1.406582e-14
## 11 7.524110e-02 4.854408e-01 4.393181e-01 2.185426e-14
## 21 5.674117e-02 4.602501e-01 4.830087e-01 4.477273e-14
## 22 9.840338e-02 4.242424e-01 4.773542e-01 4.750005e-15
## 24 5.885732e-02 4.212888e-01 5.198539e-01 2.533772e-14
## 28 8.195629e-02 4.100854e-01 5.079583e-01 1.173977e-14
## 30 1.195788e-01 3.486488e-01 5.317725e-01 1.562050e-15
## 44 4.405147e-01 5.484236e-01 1.106171e-02 1.289562e-21
## 46 3.114168e-01 6.753202e-01 1.326302e-02 1.922428e-20
## 48 1.360379e-01 7.622334e-01 1.017287e-01 8.802600e-18
## 52 4.804744e-01 4.830963e-01 3.642927e-02 4.149627e-18
## 53 6.783679e-01 3.106705e-01 1.096158e-02 3.968083e-20
## 64 3.614625e-01 3.662636e-01 2.722739e-01 1.362144e-12
## 66 9.741601e-01 2.569121e-02 1.487166e-04 1.698401e-20
## 67 9.517301e-01 4.796013e-02 3.097954e-04 3.272199e-19
## 69 2.371419e-01 7.563206e-01 6.537557e-03 5.537562e-19
## 71 3.076936e-01 6.851463e-01 7.160066e-03 1.288919e-20
## 80 1.054094e-01 8.819486e-01 1.264202e-02 6.632733e-18
## 91 3.810787e-01 5.867655e-01 3.215585e-02 2.856517e-20
## 92 3.725278e-01 5.921308e-01 3.534133e-02 4.072391e-20
## 93 5.970963e-01 3.556433e-01 4.726040e-02 1.507500e-17
## 98 4.606045e-01 4.466293e-01 9.276620e-02 2.393013e-20
## 114 2.975081e-02 5.933840e-01 3.768651e-01 1.807612e-12
## 121 6.424236e-02 1.760892e-01 7.596685e-01 1.024426e-17
## 123 2.420528e-02 9.942175e-02 8.763730e-01 9.404303e-17
## 125 2.832946e-02 1.173103e-01 8.543602e-01 9.431397e-17
## 126 3.919243e-02 1.244056e-01 8.364019e-01 3.513231e-17
## 127 1.151822e-02 9.612258e-02 8.923592e-01 1.154720e-15
## 129 1.347589e-02 1.406077e-01 8.459164e-01 3.756613e-14
## 136 1.016215e-02 1.238554e-01 8.659825e-01 5.510920e-14
## 159 2.525138e-02 3.102094e-01 6.645393e-01 3.775773e-14
## 161 2.631021e-02 4.424886e-01 5.312012e-01 1.552491e-15
## 165 1.858497e-02 3.298225e-01 6.515926e-01 4.149292e-14
## 168 1.553770e-02 2.528151e-01 7.316472e-01 1.478281e-13
## 176 3.231518e-01 6.041394e-01 7.270875e-02 2.779458e-17
## 183 3.073912e-01 3.120160e-01 3.805928e-01 2.234652e-17
## 188 7.147070e-01 2.779073e-01 7.385780e-03 3.934097e-17
## 192 7.347587e-01 2.580935e-01 7.147713e-03 4.539711e-18
## 200 9.895160e-01 1.029264e-02 1.913555e-04 4.923326e-21
## 216 2.426936e-01 6.609542e-01 9.635215e-02 8.582583e-18
## 223 5.631499e-02 5.559637e-01 3.877213e-01 1.299867e-12
## 227 5.168594e-02 1.603962e-01 7.879179e-01 8.591546e-12
## 228 6.511484e-02 3.046587e-01 6.302264e-01 1.753444e-11
## 231 6.275524e-02 5.480207e-01 3.892241e-01 2.783074e-11
## 233 4.605319e-02 1.158552e-01 8.380916e-01 3.200293e-12
## 241 5.048382e-01 4.536864e-01 4.147541e-02 1.151032e-15
## 242 3.931933e-01 5.619163e-01 4.489036e-02 6.340494e-15
## 253 5.402685e-01 4.097509e-01 4.998058e-02 2.736426e-16
## 264 5.504426e-02 4.454125e-02 9.004145e-01 1.113982e-13
## 268 8.204431e-02 4.574304e-02 8.722127e-01 1.117761e-14
## 276 8.706694e-01 1.159727e-01 1.335792e-02 1.797347e-18
## 279 8.414633e-01 1.478457e-01 1.069099e-02 2.402943e-18
## 293 9.887469e-01 1.076785e-02 4.852013e-04 3.417498e-19
## 299 9.319102e-01 6.762165e-02 4.681977e-04 2.880856e-19
## 307 5.936614e-01 1.518050e-01 2.545336e-01 1.807171e-12
## 309 2.214764e-01 4.839496e-01 2.945740e-01 8.014209e-17
## 312 2.831445e-01 5.942785e-01 1.225770e-01 1.800927e-17
## 318 9.310826e-02 5.506103e-01 3.562814e-01 1.069044e-15
## 319 1.651242e-01 5.164268e-01 3.184489e-01 1.727875e-16
## 327 3.214547e-01 6.011099e-01 7.743537e-02 1.490418e-17
## 328 1.780388e-01 6.542589e-01 1.677023e-01 4.213308e-16
## 333 8.080054e-01 1.908356e-01 1.158907e-03 8.082961e-24
## 335 3.610689e-01 4.408628e-01 1.980683e-01 6.112855e-17
## 338 2.315106e-01 4.922032e-01 2.762862e-01 1.200000e-15
## 343 2.441902e-01 7.448982e-01 1.091160e-02 5.164821e-23
## 344 8.477175e-01 1.320333e-01 2.024918e-02 3.730604e-16
## 345 8.882847e-01 9.465187e-02 1.706339e-02 4.849892e-17
## 348 9.787710e-01 2.045101e-02 7.779644e-04 2.734075e-19
## 351 8.690665e-01 1.284278e-01 2.505754e-03 4.049972e-23
## 358 9.743557e-18 8.367889e-15 6.065333e-11 1.000000e+00
## 359 2.305954e-17 1.855313e-14 1.034222e-10 1.000000e+00
## 376 7.129814e-17 3.402066e-14 3.074781e-11 1.000000e+00
## 377 1.088890e-18 1.234554e-15 1.568635e-12 1.000000e+00
## 378 2.924621e-18 2.866123e-15 3.218853e-12 1.000000e+00
## 379 9.403818e-19 1.246374e-15 1.333531e-12 1.000000e+00
## 381 6.101674e-17 3.813752e-14 2.081351e-11 1.000000e+00
## 382 3.884952e-18 4.056041e-15 3.308111e-12 1.000000e+00
## 389 4.863046e-21 1.695979e-17 3.157437e-14 1.000000e+00
## 390 3.758685e-19 5.614935e-16 6.697686e-13 1.000000e+00
## 395 5.065522e-18 4.597094e-15 4.167203e-12 1.000000e+00
## 407 5.018051e-20 1.648098e-16 6.273814e-14 1.000000e+00
## 408 2.810692e-19 3.073889e-16 6.144495e-13 1.000000e+00
## 418 6.536867e-21 1.421226e-17 2.458765e-14 1.000000e+00
## 423 5.403824e-18 6.385729e-15 2.275880e-12 1.000000e+00
## 428 9.887512e-19 6.083648e-16 7.277140e-13 1.000000e+00
## 434 1.630687e-19 1.080837e-16 5.468461e-13 1.000000e+00
## 438 4.388141e-22 6.033055e-19 9.430585e-15 1.000000e+00
## 441 3.045722e-19 3.403819e-16 1.009983e-12 1.000000e+00
## 446 2.008389e-21 2.066226e-18 3.395280e-14 1.000000e+00
## 447 2.679520e-19 2.036694e-16 1.290164e-12 1.000000e+00
## 452 1.156314e-18 8.802876e-16 3.401562e-12 1.000000e+00
## 453 1.897310e-18 1.534732e-15 4.352007e-12 1.000000e+00
## 471 9.686888e-17 1.226858e-13 2.078335e-11 1.000000e+00
## 473 2.835662e-16 2.700301e-13 4.911484e-11 1.000000e+00
## 474 1.931643e-16 1.106942e-13 7.378198e-11 1.000000e+00
## 475 1.252543e-17 2.352910e-14 2.646899e-12 1.000000e+00
## 479 1.574242e-17 1.990977e-14 5.538831e-12 1.000000e+00
## 481 9.498176e-15 8.317935e-12 3.053743e-10 1.000000e+00
##
## $x
## LD1 LD2 LD3
## 3 -3.3785177 -0.792729843 -0.004624797
## 4 -3.2380507 -0.336016676 -0.081271520
## 8 -1.3334814 -1.065733706 0.081938183
## 9 -1.0669415 -0.951492474 0.712716509
## 10 -1.6868778 -0.626937861 0.393617965
## 11 -1.6288080 -0.647765321 0.422100615
## 21 -1.5283135 -0.736664189 0.495835668
## 22 -1.8197083 -0.758035692 0.125736247
## 24 -1.5952665 -0.828046826 0.371639239
## 28 -1.7028270 -0.768769798 0.167713527
## 30 -1.9557366 -0.869483065 -0.200279979
## 44 -3.6164652 0.291695527 1.004565178
## 46 -3.2912398 0.355387320 1.309166785
## 48 -2.5705111 -0.427438153 1.121403854
## 52 -2.6900811 0.531460861 0.302936560
## 53 -3.2139452 0.807084880 0.158885597
## 64 -1.2091880 0.663527172 -0.671902765
## 66 -3.1435679 3.101868210 -0.931201400
## 67 -2.8347325 3.037029362 -0.616873235
## 69 -2.8704862 1.003159877 1.779677782
## 71 -3.3255247 0.647338390 1.550826475
## 80 -2.5525617 0.595149389 2.144377323
## 91 -3.2684339 -0.016250363 0.739948481
## 92 -3.2277367 -0.038613567 0.724259177
## 93 -2.5432627 0.597612247 -0.228454768
## 98 -3.3105418 -0.556260924 -0.001337956
## 114 -1.0620828 -0.445341856 1.158184472
## 121 -2.4918810 -1.883045887 -0.539215208
## 123 -2.1556467 -2.135879062 -0.586843400
## 125 -2.1703974 -2.051585851 -0.513140329
## 126 -2.3068235 -2.015184448 -0.628776857
## 127 -1.8177554 -2.170840038 -0.218724534
## 129 -1.4311934 -1.687399176 0.022237225
## 136 -1.3655251 -1.776491647 0.055095460
## 159 -1.4933068 -1.268445540 0.490626052
## 161 -1.8811687 -1.454504793 0.928979423
## 165 -1.4670949 -1.365300995 0.737209915
## 168 -1.2989912 -1.376059221 0.530935158
## 176 -2.4673407 0.210229916 0.485328423
## 183 -2.4972096 -0.774887135 -0.692224271
## 188 -2.3921629 1.779882774 0.062661082
## 192 -2.6441727 1.573419154 0.016919327
## 200 -3.2579134 2.792218698 -1.887190586
## 216 -2.5990440 -0.179144781 0.659860035
## 223 -1.1346472 -0.248102628 0.710948365
## 227 -0.8712314 -0.529220727 -0.702230051
## 228 -0.8219194 -0.207806425 -0.157252364
## 231 -0.7796033 0.120618093 0.588678462
## 233 -0.9689292 -0.730380111 -0.952897666
## 241 -2.0308526 1.080429894 0.089639126
## 242 -1.8263773 1.131980562 0.391785403
## 253 -2.2034139 0.845464680 -0.093298653
## 264 -1.3364876 -1.106290593 -1.952627669
## 268 -1.6299522 -1.177060449 -2.121357169
## 276 -2.7428609 1.158310523 -1.055396480
## 279 -2.7121294 1.310233579 -0.728909174
## 293 -2.7798099 2.737727554 -2.227288505
## 299 -2.8711124 2.804666568 -0.417660857
## 307 -1.1663181 0.883731162 -1.790919883
## 309 -2.3411881 -0.604840714 -0.005319465
## 312 -2.5210674 -0.176345422 0.371999256
## 318 -1.9967948 -0.766463336 0.531194937
## 319 -2.2385383 -0.677752189 0.193620455
## 327 -2.5414851 0.106391506 0.470362354
## 328 -2.1343494 -0.188402084 0.586705786
## 333 -4.1578776 1.175356875 0.490710135
## 335 -2.3885260 -0.227658149 -0.243059585
## 338 -2.0245245 -0.260831203 -0.030862461
## 343 -3.9742985 -0.263795962 1.698378001
## 344 -2.1272589 1.498081378 -1.134865629
## 345 -2.3528580 1.375485151 -1.393244593
## 348 -2.8414898 2.481300538 -1.768562811
## 351 -3.9719392 0.933938204 -0.223435941
## 358 5.8931569 -0.738000730 -0.685570219
## 359 5.8011804 -0.653406467 -0.621643918
## 376 5.7379115 0.487540685 -0.289145921
## 377 6.1698278 0.315268736 0.060514541
## 378 6.0650498 0.350142240 0.028633833
## 379 6.1810138 0.347228529 0.213510692
## 381 5.7500084 0.646245859 0.049027192
## 382 6.0343171 0.463793025 0.182746602
## 389 6.7297210 0.122810039 0.535758899
## 390 6.2793486 0.325900137 0.235616752
## 395 6.0095351 0.447478277 0.064280272
## 407 6.4913731 0.775007843 1.085626998
## 408 6.3221225 0.229605647 -0.138414686
## 418 6.7254942 0.368282822 0.277827964
## 423 6.0048907 0.821402653 0.551610249
## 428 6.2200369 0.665228383 -0.291847689
## 434 6.3979078 0.028566971 -0.774880700
## 438 7.0263223 -0.322114136 -0.806860903
## 441 6.3032099 -0.007025257 -0.261942844
## 446 6.8641157 -0.367484792 -0.979588941
## 447 6.3264012 -0.216968637 -0.763006984
## 452 6.1643786 -0.102970497 -0.569193932
## 453 6.1087292 -0.016409979 -0.417382274
## 471 5.6762221 0.885735965 0.893576593
## 473 5.5658891 0.874320387 0.710145488
## 474 5.6156011 0.456152383 -0.057514240
## 475 5.9011527 1.132745238 1.247858703
## 479 5.8799606 0.808780933 0.694290756
## 481 5.1906850 1.419150853 1.266486825
# Define a loss function (mean prediction error)
calc_class_err <- function(actual, predicted) {
mean(actual != predicted)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
calc_class_err(actual = correct_classes, predicted = lda.pred$class)
## [1] 0.254902
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 8 1 0
## med_low 4 14 6 0
## med_high 0 6 17 1
## high 0 0 0 28
The results shows as that the high and low variable groups are grouped much better to rigth classes compared to med_high and med_low groups. Overall our error rate is still around 33 % depending how observations are divided to train and test data. This is pretty high and could be better.
To get better error rate we can try to do standardise our data set by using distances. * Scale all variables to have mean = 0 and standard deviation = 1
# Bring data to R
library(MASS)
library(dplyr)
data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))
# Calculate distances
## Euclidean distance matrix
dist_eu <- dist(standardised_boston, method = "euclidean")
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(standardised_boston, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(13)
# K-means clustering, center 4
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.269 2.000 4.000
# K-means clustering, center 3
km <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.929 2.000 3.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.729 2.000 2.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 5)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 3.000 2.666 3.000 5.000
# K-means clustering, center 2
km <- kmeans(Boston, centers = 6)
# plot the Boston dataset with clusters
pairs(Boston[1:5], col = km$cluster)
pairs(Boston[6:10], col = km$cluster)
pairs(Boston[11:ncol(Boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.312 6.000 6.000
From the summaries, we can see that distances between used method differs quite much qiven longer distances by Manhattan method. Euclidean distance is the shortest path between source and destination. Manhattan distance is sum of all the real distances between source and destination. Two clusters seems to be best. By using center 4, number of cluster range between 1-4, and 2 seems to be the most common. When studying by setting higher value for center, 4 cluster seems to be most common choice. So, it seems to be hard to use summary as a only way to select good number of clusters. But from the plots we can see that 4 clusters could be presented in the data, but even that atleast 2 and 3 clusters should be also tried.
data(Boston)
# Standaridise data
standardised_boston <- as.data.frame(Boston %>%mutate_all(~(scale(.) %>% as.vector)))
# K-means clustering, center 3
km <- kmeans(standardised_boston, centers = 3)
# plot the Boston dataset with clusters
pairs(standardised_boston[1:5], col = km$cluster)
pairs(standardised_boston[6:10], col = km$cluster)
pairs(standardised_boston[11:ncol(standardised_boston)], col = km$cluster)
# Summary of clusters
summary(km$cluster)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.743 2.000 3.000
standardised_boston$N_clusters <- factor(km$cluster)
# Build up test and train datasets
# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston), size = n * 0.8)
# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404 15
# create test set
test <- standardised_boston[-ind,]
dim(test)
## [1] 102 15
# save the 'correct' clusters from test data
correct_classes <- test$N_clusters
# remove the N_clusters variable from test data
test <- dplyr::select(test, -N_clusters)
head(test)
## crim zn indus chas nox rm
## 3 -0.4169290 -0.48724019 -0.5927944 -0.2723291 -0.7395304 1.2814456
## 8 -0.4032966 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.1603069
## 10 -0.4003331 0.04872402 -0.4761823 -0.2723291 -0.2648919 -0.3994130
## 11 -0.3939564 0.04872402 -0.4761823 -0.2723291 -0.2648919 0.1314594
## 21 -0.2745709 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -1.0171035
## 23 -0.2768170 -0.48724019 -0.4368257 -0.2723291 -0.1440749 -0.2030044
## age dis rad tax ptratio black lstat
## 3 -0.2655490 0.556609050 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 8 0.9778406 1.023624897 -0.5224844 -0.5769480 -1.5037485 0.4406159 0.9097999
## 10 0.6154813 1.328320207 -0.5224844 -0.5769480 -1.5037485 0.3289995 0.6227277
## 11 0.9138948 1.211779950 -0.5224844 -0.5769480 -1.5037485 0.3926395 1.0918456
## 21 1.0488914 0.001356935 -0.6373311 -0.6006817 1.1753027 0.2179309 1.1716657
## 23 0.8215287 0.086363887 -0.6373311 -0.6006817 1.1753027 0.4406159 0.8495847
## medv
## 3 1.3229375
## 8 0.4965904
## 10 -0.3949946
## 11 -0.8190411
## 21 -0.9712629
## 23 -0.7972951
# linear discriminant analysis
lda.fit <- lda(N_clusters ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(N_clusters ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3
## 0.4702970 0.3267327 0.2029703
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3724278 -0.3387499 -0.2829898 0.03849464 -0.3098577 -0.07912173
## 2 0.7707649 -0.4872402 1.1304030 0.05576262 1.1890969 -0.51987167
## 3 -0.4082840 1.5687708 -1.1015857 -0.08027540 -1.0112004 0.88743088
## age dis rad tax ptratio black
## 1 0.02723655 0.01214864 -0.5786988 -0.6052722 -0.1000423 0.2679608
## 2 0.82255145 -0.85370946 1.1680242 1.2695565 0.5468315 -0.5919408
## 3 -1.17655926 1.26178915 -0.6037174 -0.6590027 -0.7421679 0.3540631
## lstat medv
## 1 -0.1783253 0.06127032
## 2 0.8633549 -0.69424886
## 3 -0.9375719 0.95206252
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.05143915 0.128168802
## zn -0.05754639 1.091843941
## indus 0.49017626 0.071189895
## chas 0.04698341 -0.068524065
## nox 1.06863590 0.719733618
## rm -0.15043419 0.380911844
## age -0.12108442 -0.593230736
## dis -0.12373888 0.564951427
## rad 0.69328145 0.007068937
## tax 0.92825702 0.816155360
## ptratio 0.24143922 0.085061886
## black 0.01470321 -0.013028639
## lstat 0.17568563 0.484228519
## medv -0.06969142 0.626189685
##
## Proportion of trace:
## LD1 LD2
## 0.8378 0.1622
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$N_clusters)
# plot the lda results
plot(lda.fit, dimen = 2) +
lda.arrows(lda.fit, myscale = 1)
## integer(0)
Plot of the LDA results shows that tax and rad has most effect to LD1 and age for LD2, and even so are the most influential linear separators for the clusters.
# Add crime factor to data set scaled in last code chunk
# create a quantile vector of crim and print it
bins <- quantile(standardised_boston[['crim']])
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(standardised_boston[['crim']], breaks = bins, include.lowest = TRUE)
# remove original crim from the dataset
standardised_boston <- dplyr::select(standardised_boston, -crim)
# add the new categorical value to scaled data
standardised_boston <- data.frame(standardised_boston, crime)
# Set crime to character fow a while
standardised_boston$crime<-as.character(standardised_boston$crime)
# Set 'crime' to factor variable, and give relevant levels
standardised_boston$crime[standardised_boston$crime=="[-0.419,-0.411]"] <- "low"
standardised_boston$crime[standardised_boston$crime=="(-0.411,-0.39]"] <- "med_low"
standardised_boston$crime[standardised_boston$crime=="(-0.39,0.00739]"] <- "med_high"
standardised_boston$crime[standardised_boston$crime=="(0.00739,9.92]"] <- "high"
# back to factor with correct levels
standardised_boston$crime<- factor(standardised_boston$crime, levels = c("low", "med_low", "med_high", "high"))
# Check head of edited table
head(standardised_boston)
## zn indus chas nox rm age dis
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
## rad tax ptratio black lstat medv N_clusters
## 1 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278 1
## 2 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239 1
## 3 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375 1
## 4 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886 3
## 5 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323 3
## 6 -0.7521778 -1.1050216 0.1129203 0.4101651 -1.0422909 0.6705582 1
## crime
## 1 low
## 2 low
## 3 low
## 4 low
## 5 low
## 6 low
# Build up test and train datasets
# choose randomly 80% of the rows
ind <- sample(nrow(standardised_boston), size = n * 0.8)
# create train set
train <- standardised_boston[ind,]
dim(train)
## [1] 404 15
# create test set
test <- standardised_boston[-ind,]
dim(test)
## [1] 102 15
# Save n_cluster column
n_clusterit<- train$N_clusters
# remove original n_cluster from the dataset
train <- dplyr::select(train, -N_clusters)
# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2500000 0.2574257 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 1.08456771 -0.9369443 -0.11640431 -0.8924742 0.5212608 -0.8766213
## med_low -0.09115733 -0.3312543 0.03952046 -0.5917217 -0.1115500 -0.3625932
## med_high -0.39571399 0.1510421 0.14409500 0.3846657 0.1429959 0.4294170
## high -0.48724019 1.0149946 -0.03128211 1.0406720 -0.4855777 0.7957547
## dis rad tax ptratio black lstat
## low 0.8529692 -0.6987343 -0.7602375 -0.49441664 0.3799540 -0.79526046
## med_low 0.3642410 -0.5554602 -0.5409951 -0.04120056 0.3201782 -0.17644351
## med_high -0.3749755 -0.3822389 -0.2825025 -0.26326606 0.1074278 -0.05752024
## high -0.8584374 1.6596029 1.5294129 0.80577843 -0.7464259 0.89613928
## medv
## low 0.5929402
## med_low 0.0102126
## med_high 0.2096063
## high -0.6231054
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11030024 0.90477115 -0.9432520
## indus -0.02369693 -0.21284527 0.3534688
## chas -0.06681638 -0.04469796 0.1555483
## nox 0.32271479 -0.76681432 -1.3321508
## rm -0.11928537 -0.08149308 -0.1426609
## age 0.35340567 -0.29789983 -0.2669457
## dis -0.08563196 -0.47438844 0.2270172
## rad 3.09835255 1.17955576 0.1675980
## tax 0.10830898 -0.36338770 0.2630112
## ptratio 0.15263545 0.06163493 -0.2299144
## black -0.17731254 0.01064261 0.1035738
## lstat 0.14115142 -0.15376102 0.3960080
## medv 0.18691351 -0.36944033 -0.1857561
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9436 0.0416 0.0148
# Collect crime variables
target<- factor(train$crime)
# Collect predictor data
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Next, access the plotly package. Create a 3D plot of the columns of the matrix.
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=target)
# Let's colorcode clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=n_clusterit)
Data wrangling was started in last week, and script can be found from file create_human.R. There should be 195 observations and 19 variables. Most of the variables names has been shortened and couple new variables added to the original dataset as follow: * “GNI” = Gross National Income per capita * “Life.Exp” = Life expectancy at birth * “Edu.Exp” = Expected years of schooling * “Mat.Mor” = Maternal mortality ratio * “Ado.Birth” = Adolescent birth rate
New variables * “Edu2.FM” = Edu2.F / Edu2.M * “Labo.FM” = Labo2.F / Labo2.M
There is still some editing e.g. Gender Inequality Index could be also shortened to ‘GII’. Lets’ start wrangling by that.
# Read file to R-environment
human_data<-read.table("human_data.tsv", header=TRUE, sep=";", dec=",")
# EDIT Gender Inequality Index column
colnames(human_data)[colnames(human_data)=="Gender Inequality Index"]<-"GII"
# Libraries
library(dplyr)
# Change GNI col to numeric by mutate
human_data$GNI<-human_data %>%
dplyr:::select(GNI) %>%
mutate(GNI =as.numeric(GNI))
# Exclude unneeded variables
keep <- c("Country", "Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
human_data.subset <- human_data[,keep]
# Editing previous way 'GNI' make the column to class 'dataframe'. This would be problematic in plots
# so lets' changed it back to really numeric
human_data.subset$GNI<-as.numeric(unlist(human_data.subset$GNI))
# Remove rows with all NA
human_data.subset<-human_data.subset[rowSums(is.na(human_data.subset))!= ncol(human_data.subset),]
dim(human_data.subset)
## [1] 195 9
# Seems that there isn't any complete NA rows
# Lets' remove rows that contains even one NA
human_data.subset<-human_data.subset[complete.cases(as.matrix(human_data.subset)), ]
dim(human_data.subset)
## [1] 162 9
# After subsetting there are 162/195 observations which haven't had any missing variables
# Next we should remove observations related to region
# to find out which, unique values of Country column should be taken
paste(unique(human_data.subset$Country),collapse=', ')
## [1] "Afghanistan, Albania, Algeria, Arab States, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia (Plurinational State of), Bosnia and Herzegovina, Botswana, Brazil, Bulgaria, Burkina Faso, Burundi, Cambodia, Cameroon, Canada, Central African Republic, Chad, Chile, China, Colombia, Congo, Congo (Democratic Republic of the), Costa Rica, Côte d'Ivoire, Croatia, Cuba, Cyprus, Czech Republic, Denmark, Dominican Republic, East Asia and the Pacific, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Europe and Central Asia, Fiji, Finland, France, Gabon, Gambia, Georgia, Germany, Ghana, Greece, Guatemala, Guyana, Haiti, Honduras, Hungary, Iceland, India, Indonesia, Iran (Islamic Republic of), Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Kenya, Korea (Republic of), Kuwait, Kyrgyzstan, Latin America and the Caribbean, Latvia, Lebanon, Lesotho, Liberia, Libya, Lithuania, Luxembourg, Malawi, Malaysia, Maldives, Mali, Malta, Mauritania, Mauritius, Mexico, Moldova (Republic of), Mongolia, Montenegro, Morocco, Mozambique, Myanmar, Namibia, Nepal, Netherlands, New Zealand, Nicaragua, Niger, Norway, Oman, Pakistan, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Qatar, Romania, Russian Federation, Rwanda, Samoa, Saudi Arabia, Senegal, Serbia, Sierra Leone, Singapore, Slovakia, Slovenia, South Africa, South Asia, Spain, Sri Lanka, Sub-Saharan Africa, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syrian Arab Republic, Tajikistan, Tanzania (United Republic of), Thailand, The former Yugoslav Republic of Macedonia, Togo, Tonga, Trinidad and Tobago, Tunisia, Turkey, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Venezuela (Bolivarian Republic of), Viet Nam, World, Yemen, Zambia, Zimbabwe"
# From the print:
regions<-c("South Africa","South Asia","Latin America and the Caribbean","Europe and Central Asia","East Asia and the Pacific","World","Sub-Saharan Africa")
# Subset regions from the human_data
human_data.subset <- subset(human_data.subset, !(Country %in% regions))
# We can check by dimensions and unique values that correct rows has been removed
dim(human_data.subset)
## [1] 155 9
# add country to rownames
rownames(human_data.subset) <- human_data.subset$Country
human_data.subset <- human_data.subset[,!(colnames(human_data.subset) %in% c('Country'))]
# Check dimensions, should be 155 x 8
dim(human_data.subset)
## [1] 155 8
write.csv2(human_data.subset, "human_data_30112022.tsv", row.names=TRUE)
This week topic is Dimensionality reduction techniques.
# Libraries
library(tidyverse)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.2.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
library(ggpubr)
library(MASS)
library(ggplot2)
library(GGally)
library(reshape2)
library(tidyr)
library(corrplot)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
# Let's try new package for producion of summary table and plot of the data
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.2.2
# head of data
head(human_data.subset)
## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth
## Afghanistan 0.1987421 0.1979866 9.3 60.4 1885 400 86.8
## Albania 0.6854962 0.9306030 11.8 77.8 9943 21 15.3
## Algeria 0.2105263 0.8612903 14.0 74.8 13054 89 10.0
## Arab States 0.3081009 0.7289916 12.0 70.6 15722 155 45.4
## Argentina 0.6333333 0.9774306 17.9 76.3 22050 69 54.4
## Armenia 0.7465565 0.9894737 12.3 74.7 8124 29 27.1
## Parli.F
## Afghanistan 27.6
## Albania 20.7
## Algeria 25.7
## Arab States 14.0
## Argentina 36.8
## Armenia 10.7
# Summary of the Human Data
datasummary_skim(human_data.subset)
| Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
|---|---|---|---|---|---|---|---|---|
| Edu2.FM | 155 | 0 | 0.7 | 0.2 | 0.2 | 0.8 | 1.0 | |
| Labo.FM | 151 | 0 | 0.9 | 0.2 | 0.2 | 0.9 | 1.5 | |
| Edu.Exp | 85 | 0 | 13.2 | 2.8 | 5.4 | 13.5 | 20.2 | |
| Life.Exp | 112 | 0 | 71.7 | 8.3 | 49.0 | 74.2 | 83.5 | |
| GNI | 155 | 0 | 17651.1 | 18539.2 | 581.0 | 12040.0 | 123124.0 | |
| Mat.Mor | 88 | 0 | 149.2 | 211.8 | 1.0 | 49.0 | 1100.0 | |
| Ado.Birth | 139 | 0 | 47.1 | 41.1 | 0.6 | 33.6 | 204.8 | |
| Parli.F | 123 | 0 | 20.7 | 11.4 | 0.0 | 19.3 | 57.5 |
# Summary plot
## Scatterplot
ggpairs(human_data.subset, mapping = aes(), lower =list(combo = wrap("facethist", bins = 20)))
# Correlation plot
datasummary_correlation(human_data.subset)
| Edu2.FM | Labo.FM | Edu.Exp | Life.Exp | GNI | Mat.Mor | Ado.Birth | Parli.F | |
|---|---|---|---|---|---|---|---|---|
| Edu2.FM | 1 | . | . | . | . | . | . | . |
| Labo.FM | .02 | 1 | . | . | . | . | . | . |
| Edu.Exp | .05 | .59 | 1 | . | . | . | . | . |
| Life.Exp | −.14 | .59 | .80 | 1 | . | . | . | . |
| GNI | −.02 | .43 | .62 | .63 | 1 | . | . | . |
| Mat.Mor | .24 | −.66 | −.74 | −.87 | −.50 | 1 | . | . |
| Ado.Birth | .12 | −.53 | −.70 | −.74 | −.56 | .76 | 1 | . |
| Parli.F | .26 | .08 | .21 | .19 | .09 | −.09 | −.07 | 1 |
There are differences between variables in value scale and distributions. E.g. our new variables has information combined from to column showing ratios, GNI has highest values and highest range. Overall, values difference from 0 to 123124 meaning that some sort of scaling and normalization should be done to make different variables to be comparable.
Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where (most of) the variation in the data can be described with fewer dimensions than the initial data. (wikipedia)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_data.subset)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
From the plot, we can’t see much or use it like that for good interpretation. As far, we can say that GNI seems to influence PC1.
# standardize the variables
human_std <- scale(human_data.subset)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
After scaling we can see much better different variables influence to
PC1 and PC2. Edu.Exp, Life.Exp, Labb,FM and GNI influence to PC1 forming
a group as well as Mat.Mor and Ado.Birth which influence is opposite and
forms a group which have similar influence to PC1. Let’s read a plot
e.g. we can see that Niger and Sierra Leone has high Mat.Mor and
Ado.Birth and low GNI, Labo.FM, Life.Exp and Edu.Exp. This sounds
realistic. And variables values in scale for e.g Japan, Korea and Czech
are opposite (on the other side of mean/median).
There could be also negative correlation with these groups, and inside
group positive correlation with variables. This could be study more but
we can validate this from ealier build summary plots where correlation
values can be read. Overall, we can see that these group formed from
similarly influencing variables divide our countries to groups which
have same level ‘standard of living’. PC2 is influenced by Padi.F and
Edu2.FM which have also positive correlation to each other. It shows
that countries e.g. Bolivia, Rwanda and Namibia seems to have similar
level education between gender and it positivily correlates with
Percetange of female representatives in parliament. As opposite, e.g
Jordan has high difference between genders education level and also
Percetange of female representatives in parliament level is low.
In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.(wikipedia)
# Bring Tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Study data
View(tea)
# Dimensions
dim(tea)
## [1] 300 36
# Colnames
colnames(tea)
## [1] "breakfast" "tea.time" "evening" "lunch"
## [5] "dinner" "always" "home" "work"
## [9] "tearoom" "friends" "resto" "pub"
## [13] "Tea" "How" "sugar" "how"
## [17] "where" "price" "age" "sex"
## [21] "SPC" "Sport" "age_Q" "frequency"
## [25] "escape.exoticism" "spirituality" "healthy" "diuretic"
## [29] "friendliness" "iron.absorption" "feminine" "sophisticated"
## [33] "slimming" "exciting" "relaxing" "effect.on.health"
# Summary
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
# Summary plot
# Let's visualize part of the data
keep_columns_2<- c("price","Tea")
# Mosaic plot of couple variables
mosaicplot(table(tea[,colnames(tea) %in% keep_columns_2]), xlab='Tea', ylab='Price',main='Tea and price', col='blue')
# Age has to be factorized by bins or removed --> remove
tea<-tea[!(colnames(tea) %in% 'age')]
# Multiple Correspondence Analysis (MCA)
# Could be set to vizualize also the correlation between variables and MCA principal dimension
res.mca<- MCA(tea, graph = FALSE)
# With so many variables the printed MCAfactor plots becomes harder to read.
# The object of MCA can be used for many different kind of visualizations -->
# Extract and save results for later use
var <- get_mca_var(res.mca)
var
## Multiple Correspondence Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for categories"
## 2 "$cos2" "Cos2 for categories"
## 3 "$contrib" "contributions of categories"
# Coordinates
head(var$coord)
## Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## breakfast 0.1818793 0.019872372 -0.10740185 -0.55341901 -0.09826250
## Not.breakfast -0.1678886 -0.018343728 0.09914017 0.51084831 0.09070384
## Not.tea time -0.5562500 0.004261602 0.06242859 0.09798477 0.16970005
## tea time 0.4311760 -0.003303372 -0.04839139 -0.07595269 -0.13154265
## evening 0.2760859 -0.408578586 0.34396723 0.23142519 -0.26420301
## Not.evening -0.1443495 0.213622307 -0.17984073 -0.12099896 0.13813660
# Cos2: quality on the factore map
head(var$cos2)
## Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## breakfast 0.03053547 3.645334e-04 0.010647837 0.282713168 0.008912786
## Not.breakfast 0.03053547 3.645334e-04 0.010647837 0.282713168 0.008912786
## Not.tea time 0.23984168 1.407766e-05 0.003021007 0.007442207 0.022322794
## tea time 0.23984168 1.407766e-05 0.003021007 0.007442207 0.022322794
## evening 0.03985286 8.728150e-02 0.061859319 0.028002208 0.036496104
## Not.evening 0.03985286 8.728150e-02 0.061859319 0.028002208 0.036496104
# Contributions to the principal components
head(var$contrib)
## Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## breakfast 0.5036943 0.0066328082 0.22530472 6.7101105 0.2373656
## Not.breakfast 0.4649486 0.0061225922 0.20797359 6.1939482 0.2191067
## Not.tea time 4.2859707 0.0002774934 0.06925046 0.1913584 0.6440432
## tea time 3.3222613 0.0002150984 0.05367935 0.1483310 0.4992288
## evening 0.8301633 2.0055059548 1.65293457 0.8393006 1.2274180
## Not.evening 0.4340448 1.0485640271 0.86422467 0.4388221 0.6417465
# visualize MCA - many possibilites
# Eigenvalues/Variances
# In linear algebra, an eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted by λ \lambda , is the factor by which the eigenvector is scaled.
# Geometrically, an eigenvector, corresponding to a real nonzero eigenvalue, points in a direction in which it is stretched by the transformation and the eigenvalue is the factor by which it is stretched. If the eigenvalue is negative, the direction is reversed.[1] Loosely speaking, in a multidimensional vector space, the eigenvector is not rotated.(wikipedia)
# get values
eig.val <- get_eigenvalue(res.mca)
# Dataframe contains also variance % and cumalative variance %. Those can be used to visualize how much of yhre variance is explained by it dimensions e.g. in histplot
head(eig.val)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.09006849 5.837772 5.837772
## Dim.2 0.08165357 5.292361 11.130133
## Dim.3 0.07021443 4.550936 15.681069
## Dim.4 0.06259673 4.057196 19.738264
## Dim.5 0.05578674 3.615807 23.354071
## Dim.6 0.05345502 3.464677 26.818748
# Build plot
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))
# Biplot
# The plot shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.
# The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.
fviz_mca_biplot(res.mca,
repel = TRUE,
ggtheme = theme_minimal())
## Warning: ggrepel: 279 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 59 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Contribution of variable categories to the dimensions 1-2
# The red dashed line on the graph indicates the expected average value, If the contributions were uniform.
fviz_contrib(res.mca, choice = "var", axes = 1:2, top = 15)
# Cos2
# The quality of the representation is called the squared cosine (cos2), which measures the degree of association between variable categories and a particular axis.
# Color by cos2 values: quality on the factor map
fviz_mca_var(res.mca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping
ggtheme = theme_minimal())
## Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Contribution
# The plot above gives an idea of what pole of the dimensions the categories are actually contributing to.
# Color by contribution values: quality on the factor map
fviz_mca_var(res.mca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # avoid text overlapping (slow)
ggtheme = theme_minimal()
)
## Warning: ggrepel: 61 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# From the plots we can read many things and make assumptions and maybe even conclusions, but also more test for that should be done. We can see that e.g tea shop, none working and older age seems to have positive influence to dim 2 where as student which positively correlates with age group 15-24 have negative influence to dim 2. Could this indicate that there are age and civil status related difference in tea drinking habits? Also location where tea is drinked seems to correlate with lunch tea consumption according dim 1. Could this indicates that workers like to drink their lunch tea typically in pub, chain store+ tea shop and tearoom rater than in home?
(more chapters to be added similarly as we proceed with the course!)